Configurable microprocessor architecture incorporating direct execution unit connectivity

ABSTRACT

An architecture for a highly configurable and scalable microprocessor architecture designed for exploiting instruction level parallelism in specific application code. It consists of a number of execution units with configurable connectivity between them and a means to copy data through execution units under software control.

TECHNICAL FIELD

The present invention is in the field of digital computing systems. Inparticular, it relates to the internal architecture of a configurablemicroprocessor system.

BACKGROUND ART

Much of modern microprocessor design is focused on achieving higherlevels of parallelism in instruction execution. This increases thethroughput of the processor at a given dock frequency. Moreover, in thecontext of embedded systems where power consumption is often asignificant consideration, it allows the same level of performance at alower dock frequency and thus saves power. A key problem in achievinghigh levels of parallelism is the design of a centralized register file.

As the level of parallelism in the instruction stream increases so doesthe number of access ports required to a centralized register file. Theyare required to provide operands to and write back results from all theactive functional units. The complexity of the register file grows atapproximately N³ where N is the number of access ports. The registerfile soon becomes the bottleneck in the design and starts to have astrongly detrimental affect on the maximum dock speed.

This scalability issue is further hampered by the need to provide anextensive network of feed-forward buses between the various accessports. Register read and write operations are typically performed indifferent stages of the execution pipeline. However, in order to achievehigh code performance it is a requirement that an instruction can passits results onto an immediately following instruction. Such aninstruction is executed just one dock cycle later (presuming theinstruction only takes one dock cycle to perform). This requires thatthe register file can detect reads and writes being performed on thesame dock cycle to the same register and provide special forwardingbuses to directly transfer the data to the reading unit without havingto write to the register file first. Given that the number of accessports and the requirement that every write port has to be comparedagainst every read port, this creates a very challenging circuit design.Moreover, it is within the critical path of the processor pipeline andhas a direct impact on maximum dock frequency for whole processor.

Some Very Long Instruction Word (VLIW) architectures have adopted aclustered approach to help alleviate this issue. In this model thefunctional units are partitioned into clusters, each having a privateregister file. Communication between dusters requires one additionaldock cycle of latency. Thus performance suffers if there is significantcommunication between clusters. Code generation for such machines seeksto minimise the number of data transfers between clusters.

Another approach is that undertaken within the field of TransportTriggered Architectures (TTA). Code for TTAs controls transports ratherthan operations. That is, the instruction set specifies how data itemsare moved around the machine to different functional units. It istransport rather than operation centric in nature. By explicitlymanaging the transport of data between functional units and the registerfile, a TTA is able to reduce the total number of access ports requiredto the register file. Moreover, a TTA explicitly schedules the transportof data over feed-forward buses and thus avoids the need for complexregister number comparison logic.

SUMMARY OF INVENTION

The disclosure describes a processor microarchitecture targeted at usein embedded systems where there is significant repetition of the codesequences that are executed by the processor. The microarchitecture isdesigned to be highly configurable in order to support an automatedprocessor generation method. Such a method analyses application softwareand automatically architects a processor architecture with functionalunit and connectivity resources that reflect the requirements of the keycode sequences within the application software. The disclosure providesa highly configurable and scalable microarchitecture to support such adesign trajectory.

A two-tier register file structure is used. There is a main registerfile but it has a very limited number of access ports. The codegenerator seeks to minimise the number of register file accesses bypassing data values directly between functional units and intermediateholding registers without passing them through the register file.Moreover, reads and writes to the register file are explicitly generatedby the code generator like any other operation. The register file istreated like any other functional unit in the processor and has nospecial status.

Each functional unit has output registers for holding its results.Operands for functional units are obtained via multiplexers that selectresults from a number of different result registers. The execution wordsinclude the selection settings for these multiplexers on each dockcycle. Thus rather than specifying a register number, where an operandis to be read or written, it specifies the bus on which a particulardata item is available. The code generator is aware of the structure ofthe buses in the processor and controls them alongside the functionalunits themselves. The majority of data values are passed from functionalunit to functional unit without even passing through the register file.

If every functional unit could read from any result register then theproblems of the centralized register file would return, due to the levelof connectivity to the multiplexers. Connectivity in the architecture isgenerally minimized and focused on the connections that provide the mostimpact on overall performance. Thus certain functional units may have tocommunicate data that are not directly connected. To support thiscertain functional units are able to copy data from their input operandsto their outputs. That way data can be transported around the functionalunits as required using copies through functional units.

The microarchitecture also includes a branch mechanism that allows theactual execution of a branch to be decoupled from the point of branchissue, using relatively simple hardware mechanisms. It allows themicroarchitecture to choose from one of a number of issued branches toactually execute. This can be used to reduce the number of branchesperformed and the disruption caused to the execution pipeline by theexecution of such branches.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 illustrates the copying of data through a functional unit.

FIG. 2 shows the general architecture of a functional unit and itsconnectivity to other blocks within the architecture.

FIG. 3 shows an example logical layout of functional units.

FIG. 4 illustrates how the execution word is used to control the stateof operand multiplexers in the architecture in order to control dataflow in the system.

FIG. 5 shows the decomposition of the Next Region Address into itsconstituent components.

FIG. 6 provides an illustration of how execution words are formed intoregions that have particular control flow relationships.

FIG. 7 shows an overview of the internal architecture of the branchcontrol unit.

FIG. 8 illustrates how the execution word can be broken into a number ofdifferent groups, each of which can be used for control of a particularfunctional unit.

FIG. 9 provides an overview of the components within a functional unit.

FIG. 10 also provides an overview of the components within a functionalunit and also provides information about the data and controlconnectivity between the components.

FIG. 11 provides an overview of the internal architecture of aconditional functional unit controller.

FIG. 12 provides an overview of the internal architecture of anunconditional functional unit controller.

FIG. 13 provides an overview of the internal architecture of an operandselector.

FIG. 14 provides an overview of the internal architecture of the delaypipeline.

FIG. 15 provides an overview of the internal architecture of the outputbank unit.

FIG. 16 illustrates the data flow between various pipeline stages of dueto interactions between functional units of differing latencies.

FIG. 17 illustrates a timeline show data flow between functional unitsin the data path of an example processor.

FIG. 18 illustrates a timeline of the events that occur at the end of aregion execution that allow execution of a new region to be initiated.

FIG. 19 provides a state transition diagram of the states within thebranch unit.

DESCRIPTION OF PRESENTLY PREFERRED EMBODIMENT

This disclosure describes the underlying microarchitecture of thepreferred embodiment. It shows how instructions are fetched, decoded anddirected towards the appropriate execution unit. It also shows how thebranch control mechanisms are implemented.

The philosophy of the microarchitecture is significantly different fromcontemporary RISC and VLIW architectures. These architectures tend to bevery operation centric in their nature. The instruction set consists ofseveral different operations that are executed on one of a number ofexecution units. Each of these instructions reads operands from thecentral register file and writes all results back to the same centralregister file. The instruction format consists of the specification ofthe operation and the register file location of the operands and result.The programmer does not specify the buses that are used to transportdata to and from the execution units. Indeed, these buses arearchitecturally invisible at the instruction level. In a highlypipelined architecture the bus structures are actually very complex asmultiple bypass paths also have to be present to allow the register fileto be pipelined. The register file itself is a central bottleneck of thearchitecture that needs to be connected via buses to all execution unitsin the system. To support multiple parallel operations it also needs tosupport many simultaneous read and write access ports.

As feature sizes of modern VLSI technology are reduced, the distancethat can be spanned over the chip in a single cycle is rapidly reducing.Wire propagation delays are starting to dominate over gate delays. Thebuses that connect systems together within a processor are starting tobecome much more important to the overall performance of the system.This is not sufficiently reflected in the architectural design ofprocessors.

The preferred embodiment is a highly communication orientatedarchitecture. It is the position of the bits in the execution word thatspecifies which operation should be performed. The bits themselvesexplicitly specify which buses should be used to transport operand datainto an execution unit. All data buses in the architecture are underexplicit software control. There are no hidden, bypass or feed-forwardbuses.

Although the architecture does have a central register file it istreated like any other implicit functional unit. All accesses to theregister file have to be explicitly scheduled as separate operations.Since the register file acts like any other functional unit itsbandwidth is limited. The code is constructed so that the majority ofdata values are communicated directly between functional units withoutbeing written to the register file.

Given the requirement to make the architecture highly scalable,communication of all data through a centralised register file is not aviable architectural option. Whenever a functional unit generates aresult it is held in an output register until explicitly overwritten bya subsequent operation issued to the unit. During this time thefunctional unit to which the result is connected may read it.

A single functional unit may have multiple output registers. Each ofthese is connected to different functional units or functional unitoperands. The output registers that are overwritten by a new result froma functional unit are programmed as part of the execution word. Thisallows the functional unit to be utilised even if the value from aparticular output register has yet to be used. It would be highlyinefficient to leave an entire functional unit idle in order to preservethe result latched on its output. In effect each functional unit has asmall, dedicated, output register file associated with it to preserveits results.

An example functional unit array is given in FIG. 3. The register fileunit 301 is placed at the centre with other functional units 302 placedas required by the application of the processor architecture. Given theconnectivity limitations of the functional unit array, not every unit isconnected to every other. Thus in some circumstances a data item may begenerated by one unit and needs to be transported to another unit withwhich there is no direct connection. The placement of the units and theconnections between them is specifically designed to minimise the numberof occasions on which this occurs. The interconnection network isoptimised for the data flow that is characteristic of the requiredapplication code. The microarchitecture also includes an instructioncache. It stores a subset of the code used to control the operation ofthe functional units. A new execution word is fetched on each clockcycle and distributed throughout the functional unit array in order toorchestrate the issuing of operations and the steering of data betweenfunctional units.

To allow the transport of such data items, any functional unit may actas a repeater. That is it may select one of its operands and simply copyit to its output without any modification of the data. Thus a particularvalue may be transmitted to any operand of a particular unit by usingfunctional units in repeater mode. A number of individual “hops” betweenfunctional units may have to be made to reach a particular destination.Moreover, there may be several routes to the same destination. The codegenerator selects the most appropriate route depending upon otheroperations being performed in parallel.

There are underlying rules that govern how functional units can beconnected together. Local connections are primarily driven by thepredominate data flows between the units. Higher level rules ensure thatall operands and results in the functional unit array are fullyreachable. That is, any result can reach any operand via a path throughthe array using units as repeaters. These rules ensure that any codesequence involving the functional units can be generated. Theperformance of the code generated will obviously depend on how well thedata flows match the general characteristics of the application. Codethat represents a poor match will require much more use of repeatingthrough the array.

Region Based Execution

In the preferred embodiment all execution is performed within blocks ofcode called regions. This simplifies the implementation of both theinstruction scheduling and the control mechanisms in the hardware.

A region is a block of code that only has a single entry point butpotentially many exit points. The analysis performed by the codegeneration tools is able to form groups of basic blocks into regions.Regions are often used as the basic arena in which global schedulingoptimisations are performed. Global scheduling refers to the movement ofinstructions across branches as well as within individual basic blocks.

In the architecture, regions are always executed fully. If the regioncontains a number of internal branches to basic blocks outside of theregion then they are not resolved until the end of the region reached.The compiler constructs the regions from basic blocks so that theycontain the most likely execution paths through the basic blocks. Aregion is able to perform a multi-way branch to select one of a numberof different successor regions.

FIG. 6 illustrates an example set of regions 601 and the relationshipsbetween them. It shows the execution of the individual basic blocks 603,604 and 605 within each region. The regions themselves are composed ofindividual execution words 602. The set of control edges 606 from eachregion shows the possible successor for each region.

Execution Word Representation

The preferred embodiment uses a Very Large Instruction Word (VLIW)format. This enables many parallel operations to be initiated on asingle clock cycle, enabling significant parallelism. The actual widthis configurable. Shorter widths tend to be more efficient in terms ofcode density but poorer in extracting parallelism from the application.

The instruction format is not fixed either and is dependent upon theexecution units the user defines for a particular processor. Unlike manycontemporary VLIW architectures, the architecture uses a flat decodestructure. This means that a particular execution unit is alwayscontrolled from a specific group of bits in the execution word. Thismakes the instruction decoding for the architecture verystraightforward. Other VLIWs tend to bundle a number of independentoperations into a single instruction word. They still require quitecomplex decode logic to direct different operations to the appropriateexecution units.

FIG. 4 illustrates the basic instruction decode and control paths of theprocessor. The instruction memory 404 holds the representation of theoperations in the customized format for the processor. A new executionword is fetched on each clock cycle. Each block of bits 405 in theexecution word is used for controlling a particular execution unit 401.

The bits in the execution word are used to control multiplexers 406 thatdirect data from the interconnection network to the operand inputs ofthe execution unit. Results from the execution units are routed back tothe interconnection network to be used by subsequent operations.

A branch control unit 402 allows the architecture to execute new blocksof code by loading a new value in the PC (Program Counter) 403. If abranch is not executed then the PC is just incremented on each cycle toexecute code sequentially from the instruction memory.

The code is stored in 32 bit width words in main memory and transferredto a wider instruction cache prior to actual execution. The instructioncache has a certain capacity to allow particular code loops to remaincached without continuous access to main memory required. The widerinstruction buffer can be configured in size to support powerconsumption and area goals.

All bits within the execution word have positional context. There is adirect relationship between particular bits and the functional unitsthat they control. This greatly simplifies the execution word decodingtask. The appropriate bits for a particular functional unit are simplyrouted from the execution unit word as required.

The basic structure of the execution word is illustrated in FIG. 8. Theexecution word 801 is subdivided into a number of groups 802. Each group803 controls one or more functional units 804. The number of bitsrequired to control a given functional unit is related to the number ofselectable sources for the unit's operands and the number of outputregisters for its results. As the number is increased the number of bitsrequired to uniquely specify them grows.

If a group controls more than one functional unit then bits within thegroup may be shared. A selection code is then used to indicate whichparticular functional unit is selected on each dock cycle. This overlaymechanism allows direct trade-off between code density and parallelism(i.e. performance) independently of the functional unit selections. Anarrow execution word forces more functional units to share groups. Ifmore than one of those units could be utilised on a particular cyclethen performance may be lost as only one may be selected. However, anarrow execution word increases the chances that all groups are usefullyemployed on each dock cycle and thus code density is improved. If awider execution word is employed then greater parallelism is possible(as there is less sharing within each group) but groups are more likelyto go unused on any particular dock cycle.

The whole of the execution word is used for controlling functional unitsapart from one bit. This is the End Region Flag (ERF) and is used toindicate that the last execution word in a region has been reached.

Functional Units

The microarchitecture includes a configurable number of functionalunits. Each of those functional units performs a particular operationupon a number of data operands to produce a number of results. Thefunctional units are pipelined and units with different latencies may befreely mixed in the microarchitecture. The functional unit types andconnectivity may be configured as required. In the preferred embodimentthis configuration is determined by an automated analysis that finds thefunctional unit mix and connectivity that is best matched to therequirements of the application code that the processor is to execute.

The internal architecture of functional unit is in FIG. 2. The centralcore of a functional unit 203 is the execution unit itself 201. Itperforms the particular operation for the unit. These blocks allow thefunctional unit to connect to other units and to allow the unit to becontrolled from the execution word 205.

Functional units are placed within a virtual array arrangement.Individual functional units can only communicate with near neighbourswithin this array. This spatial layout prevents the architecturalsynthesis generating excessively long interconnects between units thatwould significantly impact dock speed.

Fields within the execution word control the operand multiplexers 206.These are responsible for selecting the correct operands 202 to presentto the execution unit. In some circumstances the operand may be fixed toa certain bus, removing the requirement for a multiplexer. The number ofselectable sources and the choice of particular source buses arecompletely configurable. The control input 207 determines the type ofoperation to be performed.

All results from an execution unit are held in independent outputregisters 204. These drive data on buses connected to other functionalunits. Data is passed from one functional unit to another in thismanner. The output register holds the same data until a new operation isperformed on the functional unit that explicitly overwrites theregister.

The functional units represent the building blocks of the processor. Theselection of functional units represents the basic configurability ofthe architecture. Functional units may be selected as required for aparticular application domain. The connections between the functionalunits form them into constituent components of a fully programmableprocessor. Individual functional units may be replicated as required inorder to exploit parallelism in the software targeted at the processor.

Embedded within the functional unit is the execution unit. This is theblock that actually performs the required operations. The execution unitis surrounded by additional logic that allows the execution unit to becontrolled by software as part of a processor and to communicate withother functional units. All inputs to the execution unit are selectedfrom a number a number of data buses. These buses communicate databetween the individual units within the processor. Outputs from theexecution unit are latched and then driven over a data bus for use byanother functional unit. Each functional unit is also embedded with somecontrol logic to allow the unit to be controlled from the execution wordof the processor. A method operand is extracted that selects whichparticular operation the execution unit should perform.

A detailed architecture overview of a functional unit is shown in FIG.10. This shows the internal connectivity between the constituent blocks.Both data signals 1012 and control signals 1013 are shown. The diagramshows an execution unit with two operand ports and one output port.However, the number of both input and output ports is completelyconfigurable.

The constituent blocks are as follows:

Execution Unit: The execution unit 1001 receives operand data valuesfrom the operands 1014. These operands are fed by operand selectors1002. In general a particular operand will have multiple potential datasources 1003. However, if only one data source is required then anoperand port may be directly connected to the external data bus,avoiding the need for an operand selector. A method selection 1015 isobtained from the controller unit. This is extracted directly from theexecution word 1004 but is delayed by one dock cycle by the controllerso its presentation is in synchronization with the associated operands.The select flag 1016 is asserted if the execution unit has been selectedto perform a new operation during the dock cycle. If the flag is falsethen the method and operand inputs are undefined. The execution unitgenerates a number of results 1017.

Controller: The controller 1007 reads the opcode portion of theexecution word and compares the code against the fixed selection code1008 for the unit. If there is a match then the unit is being selected.The predicate mask 1005 shows the status of various conditions. Thepredicate condition associated with the operation belongs is specifiedas part of the execution word. An operation is only performed if thatpredicate is true. Finally, the wait flag 1006 is used to indicate apipeline stall and is used to prevent further operations being issued tothe unit.

Operand Selector(s): An operand selector 1002 is a simply a multiplexerfor selecting one of a number of data inputs from data buses. A portionof bits from the execution word is used to specify the bus to beselected. These bits are registered so that they are delayed by oneclock cycle. This causes the data steering to be performed a cycle laterthan the execution word distribution, as is required.

Delay Pipeline: The delay pipeline 1010 simply delays the outputregister mask by a number of dock cycles. The delay period is one lessthan the latency of the execution unit. Thus if the execution unit has alatency of one then no delay pipeline is required. This allows thecorrect output registers to be updated when the results from anoperation are available.

Output Registers: The number of output registers 1009 is equal to thenumber of output connections 1011 from the unit. As the execution unitgenerates results and this is registered in one or more of the outputregisters. An output register mask is included in the execution word andspecifies which registers should be latched for any given operation. Newdata from the execution is only registered if it is producing a validoutput during that cycle. Data remains in the output register untilexplicitly overwritten by subsequent operations performed on afunctional unit.

FIG. 9 shows greater detail of the control plane for a functional unit.The area 903 shows the bits within the execution word that are used tocontrol the functional unit. This is composed of a number of differentsections for describing different aspects of the operation to beperformed by the functional unit. This field is formed from a subset ofbits from an overall execution word used to control all of thefunctional units within the architecture.

The sub-field 904 controls which of the output registers should beupdated with a result. The sub-field 905 is used to control the operandmultiplexers 911 to select the correct source of data for an operation.This data is fed to the execution unit via the operand inputs 909. Thesub-field 906 provides the method that should be performed by theexecution unit and is fed to the unit via the input 910.

The optional sub-field 907 provides the number of a predicate flag whichcontrols the execution. This is used to select the corresponding statusbit from a predicate status mask 912 via the multiplexer. This is aglobal state accessible to all functional units that indicatesdynamically which instructions should be completed. This is used tocondition the functional unit select so that if a particular instructionis disabled then the functional unit operation is not performed. Certainfunctional units may be executed unconditionally and do not require sucha field.

The opcode bits 908 are used to select the particular functional unit.If the opcode does not have the required value then all of the otherbits are considered to be undefined and the functional unit performs nooperation.

Controller Unit

The controller glue unit is responsible for generating the unit selectsignal, the result selector and the output register mask. The unitselect signal is asserted if a new operation is being initiated on theexecution unit during the cycle. The output register mask is used tocontrol the registering of new data in the output registers of thefunctional unit.

There are two distinct types of controller unit depending upon whetherexecution is conditional on a particular predicate. If an execution unithas no side effects then it can be executed unconditionally as aspeculative use of the unit will not have any permanent side effects.Units that result in side effects (such as any unit with internal statesuch as register file or memory unit) must always be executedconditionally. Unconditional units can have a more compactrepresentation than conditional units as no predicate number needs to bespecified as part of the execution word.

Conditional Controller

FIG. 11 shows the internal architecture and connectivity for aconditional controller glue unit. The figure shows both data signals1115 and control signals 1114. A conditional controller 1101 is usedwhen the execution unit has internal state so that certain methodscannot be executed speculatively. A predicate selector field 1106 isused to select 1107 the appropriate bit from a predicate status mask1105. This is used to gate both the unit select 1109 and the outputregister mask 1110. An incoming opcode 1103 is compared against a fixedselection code 1104 using the comparator 1108. The output of this isalso used to gate the unit select 1109 and output register mask 1110.

The register 1102 is used to delay the unit select 1109 so that it isvalid during the execute cycle of the execution unit. Another registeris used to hold the conditioned form of the output register mask 1113.This ensures that the output registers are not updated if a unit is notselected during a cycle. Finally the method 1112 is simply delayed byone clock cycle to generate the method 1111 that is in synchronizationwith the other signals.

Unconditional Controller

FIG. 12 shows the internal architecture of an unconditional controllerglue unit. Both data signals 1210 and control signals 1211 are shown. Anunconditional controller 1201 is used when the execution unit has nointernal state so all methods can be performed speculatively. The opcode1202 is compared against the fixed selection code 1203 using thecomparator 1212. This generates a signal that conditions the unit select1204 and output register mask 1205. The unit select is registered 1207to generate a unit select during the execute cycle of the executionunit. The output register mask 1208 is conditioned and registered toproduce the value 1205 in the correct clock cycle. The method field 1209is simply registered to generate the method 1206 in the correctexecution cycle.

Operand Selector Unit

FIG. 13 shows the basic structure of an operand selector unit. Anoperand selector 1301 is primarily composed of a multiplexer 1303 thatdirects the contents of a particular data bus 1302 to the output operand1304. The output operand is then fed to the input of an execution unit.The switching of the multiplexer is performed during the first executioncycle of the execution unit. The multiplexer is controlled directly byspecific bits in the execution word 1306. These are distributed duringthe decode cycle and are held in the register 1305 until the firstexecution cycle.

Delay Pipeline Unit

The delay pipeline simply delays the Output Register Mask bits so thatthey are available on the clock cycle during which the results from theexecution unit are generated. The delay pipeline is only required if thelatency of the execution unit is greater than one clock. The delayrequired in the pipeline is one less than the latency of the executionunit.

The architecture of the delay pipeline is shown in FIG. 14. The delaypipeline 1401 contains a number of register stages 1404 that delay theinput 1402 relative to the output 1403.

Output Registers Unit

The output registers unit hold results from a particular result port ofan execution unit. The architecture is shown in FIG. 15. The outputregisters unit 1501 may drive a number of connection buses to theoperands of other functional units 1503. Each bus has an associatedregister 1502. The output register mask (that is specified as part ofthe execution word) 1505 determines which particular registers areupdated by an operation executed on the unit. The data 1506 is obtainedfrom the execution unit. An optional test chain 1504 allows the state ofthe registers to be read and written by a debug system.

Copying Operands

Certain functional units are able to perform a copy operation as a sideeffect of their normal operation. Copying functionality is required toensure that all the operands in the processor are fully reachable fromall the results. That is, it is possible to move any result to anyoperand unit via a sequence of copy operations if no direct physicalconnection is available between the units. The processor architecture isconfigured so that this is the case.

This copy mechanism has the advantage that it makes use of a side effectof the units operation to perform a copy. Thus very little additionallogic is required to support the copying functionality.

FIG. 1 provides an overview of the copying mechanism. The mechanismrelies on the fact that operating with the value 0 is an identityoperation for a large number of unit types. The example shows the use ofan adder unit 101 for providing a copy but the technique is equallyapplicable to logical units and various other unit types. Addition of 0to an operand copies the input operand to the result.

If a copy is to be performed from the upper operand then the loweroperand is set to have a value of 0 as shown in 103. This is achieved byfixing the 0 selection for the operand selector to be tied to theliteral value 0. The upper operand is added to 0 thus producing a copyof the input value on the output. Conversely if a copy is to be made ofthe lower operand then the upper operand is set to 0 as shown in 104.The input operands to the unit itself are shown as 102 and the copiedresults are held in the output registers 105. The operand multiplexers106 are used to select the appropriate input data.

A special copy method is nominated in the definition of the functionalunit that can be used as copies. Such a method must be able to use 0 toperform an identity operation if there are multiple input operands thatmay be copied.

Pipeline Timing

This section describes the cycle level timing of various activities inthe processor pipeline. The architecture is characterized by having ashort and highly regular control pipeline but with the flexibility toallow functional units with arbitrary length internal pipelines to beincluded in the processor. Due to the partitioned nature of the controlflow paths in comparison to the data paths, they have separatepipelines.

The control path uses a very simple three stage pipeline reminiscent ofearly RISC architectures. The three stages are fetch, decode andexecute. During the fetch stage the next execution word is read form theinstruction cache. During the decode stage the execution word isdistributed to the functional units and the appropriate segments aredecoded by the units. Finally, during the execute cycle the operationsare presented and initiated in the appropriate functional units.

Each of the functional units has its own, independent, pipelinecontrolled from a master clock. The length of the execution pipeline foreach unit is specified in the execution unit model as its latency. Thecode generator automatically takes account of the length of executionunit pipelines in the management of the data flow between functionalunits. The independent specification of the pipeline length for eachexecution unit allows great flexibility in the construction of theindividual units. Each functional unit can generate a wait signal. Ifthis is asserted then the entire pipeline of the processor is stalled.This allows the implementation of execution units that sometimes requirean extended latency period. For instance, it can be used for cachememory units where the latency is longer if a particular data item isnot present in the cache.

The short pipeline allows the branch that occurs at the end of a regionto occur without any pipeline bubbles. The last execution of region canoccur back-to-back with the first execution cycle of the succeeding one.

Instruction Timing

FIG. 16 provides a timeline of instruction execution in thearchitecture. During the fetch cycle 1601 the EWA (Execution WordAddress) 1606 is used to address the instruction cache 1607 and obtainan execution word. During the decode cycle 1602 the appropriate bitsfrom the word are distributed 1608 to all the functional units in thesystem. Each of these has comparison logic embedded within thecontroller glue unit to determine if an operation for that unit has beenselected. If so then an operation on functional unit is initiated. Allfunctional units have at least one execute cycle 1603, 1604 and 1605.The data buses 1612 distribute results from functional units to theinputs of other functional units. Each functional unit has a definedlatency. The pipelines of the functional units run independently and donot affect the timing of instruction fetching and decoding. The exampleshows a single cycle functional unit 1611, a multi-cycle functional unit1610 and a memory unit 1609.

In a typical RISC pipeline all functional unit pipelines must becompleted by a write back of results into a centralised register file.Thus the individual functional unit pipelines are intimately tied intothe overall control pipeline of the processor as appropriate feedforward paths must be managed to feed data from register writes tosubsequent register reads. Since the processor uses software to manageaccess to register files (they are treated like any other functionalunit) the functional unit pipelines can be effectively separated fromthe overall fetch and decode pipeline. The code scheduling manages thedata bus resources and ensures results are only read when they areavailable from the outputs from functional units.

Data Flow Timing

FIG. 17 illustrates the data flow timing of functional units in thepreferred embodiment. It shows a particular dynamic data path throughthe functional units as results are passed from one unit to the next.Each dock cycle boundary is shown as 1705. The initial result isproduced from a single cycle functional unit 1701. The result iscalculated during cycle 1. It is latched by the output register in theunit at the end of cycle 1 and then driven onto the output bus duringcycle 2. At the start of cycle 2 the result is steered into a two-cyclefunctional unit 1702 operand. It is operated upon during the remainderof that cycle and during cycle 3. At the end of cycle 3 it is latchedinto the output register and the result driven during cycle 4. It isthen steered into another single cycle execution unit 1703. Finally itis held in the output register for an extra cycle while other operandsfor a subsequent operation become available. During cycle 6 the dataitem is written into a memory unit 1704.

Region Succession Mechanism

The control mechanism for performing a region succession is illustratedin FIG. 18. Such a succession occurs when the end of a region isreached.

The destination address is determined prior to the end of the region andput into register 1812. A sufficient number of clock cycles are leftbetween the resolution of the last potential branch in the region andthe last execution word in the region (in which the ERF flag is set).This will leave a new instruction address available that has been lookedup from the instruction cache.

The mechanism allows a flag to be set on the last instruction of aregion (ERF) and to immediately initiate a succession so that the firstinstruction from the new region can be executed without any furtherlatency.

The instruction 1804 is the last to be executed in a region. Thus it hasthe End Region Flag (ERF) set 1805 which is used to control amultiplexer 1811 that selects the next execution address from either theExecution Word Address (EWA) 1813 or the new address 1812. The nextexecution address is applied to the instruction cache 1810. Thisselection can be performed during the same cycle as the access itself,thus allowing very quick address steering. The EWA is incremented 1807by one on each cycle 1814 so that execution is advanced through theregion. Thus the first instruction 1808 of the new region is executed asEWA is loaded with the new address plus one. The ERF for the firstinstruction of the new region is reset 1806, causing the selection ofthe EWA pointing to the second instruction of the region. Thusinstruction 1809 is the second to be executed from the new region.

Each instruction consists of a fetch cycle 1801, a decode cycle 1802 andan execute cycle 1803.

Branch Control Unit

Branches operations may be issued to the branch unit. Branch operationsonly load the required destination information into the branch registerswithin the branch control unit. The actual branch is not performed untilthe end of the region is reached. Thus a multi-way branch is resolved atthe end of the region execution.

The branch control unit determines which region will be executed next.The unit is able to handle multi-way branch conditions. A number ofbranch destinations with associated conditions may be issued in aregion. The branch control unit determines which branch will be taken onthe basis of which conditions evaluate to true and the relative priorityof the branches.

Region Branch State (RBS)

The RBS is a register that holds the current state for a destinationbranch selected from a region. The RBS has three possible states aslisted below:

Default: This is the default state that indicates that there should be afall through to the following region when the execution of the currentone is completed.

Restart: This indicates that the current region should be re-executedwhen the current execution is completed. The region address is obtainedfrom the Region Base Address register.

Branch: Indicates that the branch control unit has selected a branchdestination. Branches are given a static priority which may differ fromtheir issue order. The branch issued with the highest priority and atrue condition is selected.

The state transitions are detailed in FIG. 19. The initial state isDefault 1901. The other possible states are Branch 1902 and Restart1903. If a branch is selected 1904 then a transition is made from theDefault to the Branch state. A return to the Default state is made ifthe end of the region is reached. Earlier branches may be selected 1905while in Branch mode without changing the state. In some circumstances adata hazard may require a re-execution of a region. The transition 1909is then made from Default state to Restart state to arrange for theregion to be re-executed to resolve the hazard. If in Branch state and aregion execution needs to be repeated due to a data hazard then atransition 1907 is made to the Restart state. Finally, if in Restartstate and a branch earlier than the cause of a restart performs a branchthen the transition is made to the Branch state.

Region Base Address (RBA)

The RBA register holds the address of the start of the region currentlybeing executed. It is loaded with the value of Next Region Address (NRA)at the start of each new region execution. It is used to generate thenext value of NRA if a region is to be restarted.

Next Region Address (NRA)

The NRA register contains the address of the next region that is to beexecuted. The branch resolution unit calculates the NRA as each branchis issued. The highest priority branch s selected as the destination.The branch does not occur until the end of the current region isreached. The NRA consists of two fields. There is a full destinationaddress that allows the specification of an address in main memory.

The use of the NRA is illustrated in FIG. 5. When the end of the regionis reached the NRA register 501 is used to find the correct entry in theinstruction cache and to set the initial predicate status bits. Theaddress of the region within the instruction cache 502 is looked up 504in the instruction cache and is then loaded into EWA, from whereexecution of the region is commenced. The lowest predicate to be set 503is converted into a mask 505 showing which predicates are valid and thenloaded into a predicate mask register 507.

Branch Resolution Architecture

This unit is responsible for selecting a destination address. Thestructure of the branch resolution unit is shown in FIG. 7. A branchdestination address is supplied 707 by the branch functional unit. Amultiplexer 711 selects between that address and the RBA 703 on thebasis of the loop flag 706 supplied from the branch functional unit.This allows a branch to be issued that causes a branch to the start ofthe region without having to specify a destination address.

A Next Execution Address (NEA) 702 is used to hold the address of afollowing region to be executed in the absence of a branch being issuedin a region.

A branch priority and predicate condition 708 is supplied from thebranch functional unit. Multiplexer 711 selects the default state 709 ifa loop is being performed. This is compared against the previouslyhighest priority from the current Next Region Address (NRA) 704. If abranch priority is higher than a previously selected branch then it isused instead. A demultiplexer 710 is used to determine if the branchpredicate 705 is true. The block 701 is responsible for maintaining theRBS state machine and selecting destinations as required. It supplies asquash vector 706 to a predicate control unit.

At the end of a region execution the NRA 704 holds the destinationaddress that is supplied to the instruction cache via 707.

It is understood that there are many possible alternative embodiments ofthe invention. It is recognized that the description contained herein isonly one possible embodiment. This should not be taken as a limitationof the scope of the invention. The scope should be defined by the claimsand we therefore assert as our invention all that comes within the scopeand spirit of those claims.

1. A microprocessor with an architecture incorporating several executionunits, whereby: (a) one or more registers store results from particularexecution units; (b) execution unit operands receive data from one suchregister; (c) certain execution units are able to copy data from theiroperands to result registers; and (d) the copy capability is used toallow execution units that are not directly connected to communicatedata.
 2. The microprocessor according to claim 1 whereby one or more ofthe execution units may be register files.
 3. The microprocessoraccording to claim 1 whereby the set of registers associated with aparticular execution unit to be written may be specified for eachoperation.
 4. The microprocessor according to claim 3 whereby thespecification of registers to write is represented in an instructionformat.
 5. The microprocessor according to claim 4 whereby thespecification of registers to write is delayed in a pipeline so as to beavailable on the same clock cycle as the results.
 6. The microprocessoraccording to claim 1 whereby the connectivity between execution units isknown to code generation software tools.
 7. The microprocessor accordingto claim 1 whereby available execution units are specified in a libraryfile.
 8. The microprocessor according to claim 7 whereby theconnectivity of execution units to other units in the system isconfigurable.
 9. The microprocessor according to claim 8 whereby thenumber of output registers associated with an execution unit isconfigurable.
 10. The microprocessor according to claim 1 whereby theupdate of the result registers is dependent on global condition statefor certain execution units.
 11. The microprocessor according to claim10 whereby the state used to control the output register update isselectable as part of the instruction set.
 12. The microprocessoraccording to claim 1 whereby certain identity operations may be issuedto an execution unit in order to perform a copy.
 13. The microprocessoraccording to claim 1 whereby the operation of certain bits with anexecution word control certain execution units on a cycle by cyclebasis.
 14. The microprocessor according to claim 13 whereby the numberof bits required to control each execution unit varies depending uponthe extent of its connectivity.
 15. The microprocessor according toclaim 13 whereby certain bits within the execution word for eachexecution unit select different types of operation to be performed. 16.The microprocessor according to claim 1 whereby each result register maybe connected to one or more execution unit operands.
 17. Themicroprocessor according to claim 1 whereby a source register for aparticular execution unit operand may be specified by the instructionset.
 18. The microprocessor according to claim 1 whereby the processorexecutes a sequence of contiguous execution words.
 19. Themicroprocessor according to claim 18 whereby, when the end the executionword sequence is reached, execution may branch to one of a number ofdifferent execution word addresses.
 20. The microprocessor according toclaim 19 whereby the same execution word sequence may be repeated toresolve a data hazard.
 21. The microprocessor according to claim 20whereby there is a branch control unit for determining the destinationof such branches.
 22. The microprocessor according to claim 21 wherebythe branch control unit may accept branches out of their sequentialorder.
 23. The microprocessor according to claim 22 whereby the branchcontrol unit may disable the operation of certain subsequent operationsdepending on the sequential position of an accepted branch.
 24. A methodof operation used in a microprocessor with an architecture incorporatingseveral execution units, whereby: (a) one or more registers storeresults from particular execution units; (b) execution unit operandsreceive data from one such register; and (c) certain execution units areable to copy data from their operands to result registers; and (d) thecopy capability is used to allow execution units that are not directlyconnected to communicate data.
 25. (canceled)